Professional Certificate Capstone 1
Artificial Intelligence & Machine Learning

Overview: In this capstone, my goal is to do the heavy lifting of the project. I will explore my data source, testa a few of the techniqueues, and com up with a solution or answer to my research problem.

Research Problem

Data Source

Modeling Techniques To Be Used:

Expected Results

Why This Question Is Important



Understanding the Data


From our data source, we are presented with various data from over 10 years of clinical care at 130 US hospitals and integrated delivery networks. It includes over 40 features representing patient and hospital outcomes. Information was extracted from the database for encounters that satisfied the following criteria:

The data contains such attributes as patient number, race, gender, age, admission type, time in hospital, medical specialty of admitting physician, number of lab test performed, HbA1c test result, diagnosis, number of medication, diabetic medications, number of outpatient, inpatient, and emergency visits in the year before the hospitalization, etc.



Prerequisites + Reading In The Data


Here we will use pandas to read in the dataset diabetic_data.csv and assign a meaningful variable name.

Our data is pretty clean to start. There are some columms we will take care of below.



Understanding the Features


Here we will examine the data and determine if any of the features are missing values or need to be coerced to a different data type. Some features can be dropped due to being grossly empty. Other may need to be cleaned up further.

Input Variables:

  1. race - The race of the patient
  2. gender - The gender of the patient (some marked as unknown)
  3. age - The age of the patient
  4. admission_type_id - Integer identifier corresponding to 9 distinct values, for example, emergency, urgent, elective, newborn, and not available
  5. discharge_disposition_id - Integer identifier corresponding to 29 distinct values, for example, discharged to home, expired, and not available
  6. admission_source_id - Integer identifier corresponding to 21 distinct values, for example, physician referral, emergency room, and transfer from a hospital
  7. time_in_hospital - Integer number of days between admission and discharge
  8. medical_specialty - Integer identifier of a specialty of the admitting physician, corresponding to 84 distinct values, for example, cardiology, internal medicine, family/general practice, and surgeon
  9. num_lab_procedures - Number of lab tests performed during the encounter
  10. num_procedures - 4
  11. num_medications - Number of distinct generic names administered during the encounter
  12. number_outpatient - Number of outpatient visits of the patient in the year preceding the encounter
  13. number_emergency - Number of emergency visits of the patient in the year preceding the encounter
  14. number_inpatient - Number of inpatient visits of the patient in the year preceding the encounter
  15. diag_1 - The primary diagnosis (coded as first three digits of ICD9); 848 distinct values
  16. diag_2 - Secondary diagnosis (coded as first three digits of ICD9); 923 distinct values
  17. diag_3 - Additional secondary diagnosis (coded as first three digits of ICD9); 954 distinct values
  18. number_diagnoses - Number of diagnoses entered to the system 0%
  19. max_glu_serum - Indicates the range of the result or if the test was not taken. Values: “>200,” “>300,” “normal,” and “none” if not measured
  20. A1Cresult - Indicates the range of the result or if the test was not taken. Values: “>8” if the result was greater than 8%, “>7” if the result was greater than 7% but less than 8%, “normal” if the result was less than 7%, and “none” if not measured.
  21. metformin - medical drug
  22. repaglinide - medical drug
  23. nateglinide - medical drug
  24. chlorpropamide - medical drug
  25. glimepiride - medical drug
  26. acetohexamide - medical drug
  27. glipizide - medical drug
  28. glyburide - medical drug
  29. tolbutamide - medical drug
  30. pioglitazone - medical drug
  31. rosiglitazone - medical drug
  32. acarbose - medical drug
  33. miglitol - medical drug
  34. troglitazone - medical drug
  35. tolazamide - medical drug
  36. examide - medical drug
  37. citoglipton - medical drug
  38. insulin - medical drug
  39. glyburide-metformin - medical drug
  40. glipizide-metformin - medical drug
  41. glimepiride-pioglitazone - medical drug
  42. metformin-rosiglitazone - medical drug
  43. metformin-pioglitazone - medical drug
  44. change - Indicates if there was a change in diabetic medications (either dosage or generic name). Values: “change” and “no change”
  45. diabetesMed -Indicates if there was any diabetic medication prescribed. Values: “yes” and “no”

Output variable (desired target):

  1. readmitted - (<30 for less than 30 times, >30 for greater than 30 times, NO for not previously admitted)

Observations

Further Visuals


Observations

Observations

Observations

Observations

Observations

Observations

Observations

Observations

Observations

Observations

Observations

Observations

Observations

Observations

Observations

Observations

Observations

Observations

Observations

Observations

Observations

Observations

Observations

Observations

Observations

Observations

Observations

Observations

Observations

Observations

Observations

Observations

Observations

Observations

Observations

Observations

Observations

Observations



Understanding the Task

Identification of Business Goals



Engineering Features


Now that basic business goals and objectives have been stated, we will build a basic model to get started. Before we can do this, we must work to encode the data where the values are not already numerical. I will prepare the features and target column for modeling with appropriate encoding and transformations.



Train/Test Split

With our data prepared, split it into a train and test set



A Baseline Model


Before we build our first model, we want to establish a baseline.</p>



A Simple Model


Here we will use Logistic Regression to build a basic model on our data.



Score the Model


Viewing our accuracy of this model



Model Comparisons




Problem 11: Improving the Model





Implementation of Counterfactual Explanations



Diverse Counterfactual Explanations (DiCE) for ML: https://github.com/interpretml/DiCE




Results of Exploratory Data Analysis